Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port boost::regex as regex search engine. #722

Closed
wants to merge 4 commits into from

Conversation

atauzki
Copy link
Contributor

@atauzki atauzki commented Oct 4, 2023

code are mainly from other open souce projects. Tested by mingwg-cc and clang-msvc.
but there is a problem compiling with MSVC, with the calling convention __vectorcall is not compatible with boost::regex.

@zufuliu zufuliu added the regex label Oct 4, 2023
@zufuliu
Copy link
Owner

zufuliu commented Oct 4, 2023

Notepad++'s GPL code can't be used here. AnsiDocumentIterator and UTF8DocumentIterator is same as other two iterators inside Document.cxx for NO_CXX11_REGEX?

@atauzki
Copy link
Contributor Author

atauzki commented Oct 4, 2023

They should be the same function as these 2 classes.

@zufuliu
Copy link
Owner

zufuliu commented Oct 5, 2023

I not yet decided how to handle this PR (external regex engine), the builtin engine currently is used for auto-completion (finding words in document):
image

https://github.com/zufuliu/notepad2/blob/7cfbd60490c0edd2d1cb7199a3004fbc5a669b7f/src/EditAutoC.c#L804-L818

Though don't known how, extent builtin engine to add missing syntax would have small binary than using C++, Boost or other engines.

zufuliu added a commit that referenced this pull request Oct 6, 2023
The regex code to find words in current document is actually much slow than
plain find. the removal makes it possible to use external regex engine.
@zufuliu
Copy link
Owner

zufuliu commented Oct 6, 2023

The experimental regex syntax (3~6 times slow than plain find) is removed by 631662f. we can use SCI_OWNREGEX to make custom regex engine (based on the BuiltinRegex at end of Document.cxx), which should result in clean changes (e.g. implement the custom engine in separator files).

It's maybe better to exam/compare other regex libraries for small dependents, small binary, etc.

@atauzki
Copy link
Contributor Author

atauzki commented Oct 6, 2023

PCRE2 or google RE2? PCRE2's library seems to be slightly lighter and faster, RE2 is fastest among these libs but may lack in functionality.

@zufuliu
Copy link
Owner

zufuliu commented Oct 6, 2023

PCRE2 or google RE2?

Not sure. there are others listed on https://handwiki.org/wiki/Software:Comparison_of_regular_expression_engines
Or just make one with std::regex (JavaScript syntax) without code from RESearch.cxx.

@atauzki
Copy link
Contributor Author

atauzki commented Oct 6, 2023

I've seen a post in zhihu and boost's benchmark data. They all shows that the std::regex implementation is very slow.

Oniguruma maybe slower compared to boost::regex. I used it before in EmEditor but it uses the old version.

@zufuliu
Copy link
Owner

zufuliu commented Oct 8, 2023

with BOOST_REGEX_STANDALONE only about 40 header files inside https://github.com/boostorg/regex/tree/develop/include/boost are required. I think boost regex can be integrated with following rough steps:

  • download latest code from https://github.com/boostorg/regex/tags
  • copy include\boost into scintilla, i.e. copy boost folder into scintilla\include
  • modify project files and makefiles, add SCI_OWNREGEX, BOOST_REGEX_STANDALONE and other boost preprocessor definitions
  • add a *.cxx file to implement boost regex wrapper (based on existing C++11 regex code) and CreateRegexSearch() function
  • delete unused headers inside scintilla\include\boost

@atauzki
Copy link
Contributor Author

atauzki commented Oct 8, 2023

okay,in additional there's a vc project for PCRE2. But I have no enough time in weekdays to apply it to another branch and test them.

@zufuliu
Copy link
Owner

zufuliu commented Oct 8, 2023

But I have no enough time in weekdays to apply it to another branch and test them.

integrate other engines are more complicated, most of them only support (UTF-8 encoded) string parameter instead of custom iterator.

@zufuliu
Copy link
Owner

zufuliu commented Oct 8, 2023

Related change for speed: https://sourceforge.net/p/scintilla/feature-requests/1500/

@atauzki
Copy link
Contributor Author

atauzki commented Oct 11, 2023

There's a bug for find previous match, as it always finds next match and keep finding it in that place. The std::regex related code may have the same issue.

zufuliu pushed a commit that referenced this pull request Oct 12, 2023
…ex engine, PR #722.

Also make it possible to build with only `std::regex` when `SCI_OWNREGEX` is defined.
@zufuliu zufuliu mentioned this pull request Oct 12, 2023
@zufuliu
Copy link
Owner

zufuliu commented Oct 12, 2023

@atauzki can you rebase the code on main and change maxTag loop to following (see boostorg/regex#197) to pass CI builds?

		const int maxTag = std::min(static_cast<int>(match.size()), RESearch::MAXTAG);
		for (int co = 0; co < maxTag; co++) {
			search.bopat[co] = match[co].first.Pos();
			search.eopat[co] = match[co].second.PosRoundUp();
		}

@atauzki
Copy link
Contributor Author

atauzki commented Oct 13, 2023

reverse search bugs that still exists:

  1. cannot include new lines.
  2. empty chars like \b returns the last character.(Fixed)

and forward search bug:
\b don't go to next occurance, it just stops at original place.(Fixed)

the ^ and $ behavior is problematic too.

@atauzki atauzki force-pushed the feat_boost_regex branch 2 times, most recently from 071465e to 363d6b8 Compare October 13, 2023 04:12
@zufuliu zufuliu added this to the v4.24.01 milestone Oct 13, 2023
@zufuliu
Copy link
Owner

zufuliu commented Oct 14, 2023

It seems more work is required to make it usable, the speed is slow than expected, ByteIterator is a bit slow than Scintilla's builtin regex, UTF8Iterator is even slow.

@atauzki
Copy link
Contributor Author

atauzki commented Oct 15, 2023

the ^ and $ behavior is problematic too.

The boost::regex_constants::match_not_eol flag doesn't work as expected,
match_not_bol seems wrong placed, causes it just finds the origin place in line start forever.

@zufuliu
Copy link
Owner

zufuliu commented Oct 15, 2023

Basic version (synced from std::regex code) is committed into boost_regex branch (for further development).

@zufuliu zufuliu closed this Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants